Benchmarking Data Curation Systems

نویسندگان

  • Patricia C. Arocena
  • Boris Glavic
  • Giansalvatore Mecca
  • Renée J. Miller
  • Paolo Papotti
  • Donatello Santoro
چکیده

Data curation includes the many tasks needed to ensure data maintains its value over time. Given the maturity of many data curation tasks, including data transformation and data cleaning, it is surprising that rigorous empirical evaluations of research ideas are so scarce. In this work, we argue that thorough evaluation of data curation systems imposes several major obstacles that need to be overcome. First, we consider the outputs generated by a data curation system (for example, an integrated or cleaned database or a set of constraints produced by a schema discovery system). To compare the results of different systems, measures of output quality should be agreed upon by the community and, since such measures can be quite complex, publicly available implementations of these measures should be developed, shared, and optimized. Second, we consider the inputs to the data curation system. New techniques are needed to generate and control the metadata and data that are the input to curation systems. For a thorough evaluation, it must be possible to control (and systematically vary) input characteristics such as the number of errors in data cleaning or the complexity of a schema mapping in data transformation. Finally, we consider benchmarks. Data and metadata generators must support the creation of reasonable goldstandard outputs for different curation tasks and must promote productivity by enabling the creation of a large number of inputs with little manual effort. In this work, we overview some recent advances in addressing these important obstacles. We argue that evaluation of curation systems is itself a fascinating and important research area and challenges the curation community to tackle some of the remaining open research problems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Study of the foundation, models and issues of research data curation and management in scientific and academic environments

Background and Aim: The purpose of this paper is to study, identifying and discuss the foundation and concepts, models and frameworks, dimensions and challenges of research data curation and management in scientific and academic environments. Method: This article is a review article and library method was used to collect scientific and research texts in this field. In this research, external an...

متن کامل

Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster

UNLABELLED Despite increasing numbers of computational tools developed to predict cis-regulatory sequences, the availability of high-quality datasets of transcription factor binding sites limits advances in the bioinformatics of gene regulation. Here we present such a dataset based on a systematic literature curation and genome annotation of DNase I footprints for the fruitfly, Drosophila melan...

متن کامل

Advancing Geospatial Data Curation*

Digital curation is a new term that encompasses ideas from established disciplines: it defines a set of activities to manage and improve the transfer of the increasing volume of data products from producers of digital scientific and academic data to consumers, both now and in the future. Research topics in this new area are in a formative stage, but a variety of work that can serve to advance t...

متن کامل

A Pipeline for Post-Crisis Twitter Data Acquisition

Due to instant availability of data on social media platforms like Twitter, and advances in machine learning and data management technology, real-time crisis informatics has emerged as a prolific research area in the last decade. Although several benchmarks are now available, especially on portals like CrisisLex, an important, practical problem that has not been addressed thus far is the rapid ...

متن کامل

Mouse Phenome Database: an integrative database and analysis suite for curated empirical phenotype data from laboratory mice

The Mouse Phenome Database (MPD; https://phenome.jax.org) is a widely used resource that provides access to primary experimental trait data, genotypic variation, protocols and analysis tools for mouse genetic studies. Data are contributed by investigators worldwide and represent a broad scope of phenotyping endpoints and disease-related traits in naïve mice and those exposed to drugs, environme...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IEEE Data Eng. Bull.

دوره 39  شماره 

صفحات  -

تاریخ انتشار 2016